Picture for Koustuv Sinha

Koustuv Sinha

A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

Add code
Feb 03, 2026
Viaarxiv icon

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Add code
Sep 30, 2025
Figure 1 for Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Figure 2 for Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Figure 3 for Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Figure 4 for Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Viaarxiv icon

A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

Add code
Jun 11, 2025
Figure 1 for A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs
Figure 2 for A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs
Figure 3 for A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs
Figure 4 for A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs
Viaarxiv icon

CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models

Add code
Jun 11, 2025
Viaarxiv icon

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Add code
Jun 11, 2025
Viaarxiv icon

Multi-Modal Language Models as Text-to-Image Model Evaluators

Add code
May 01, 2025
Viaarxiv icon

Scaling Language-Free Visual Representation Learning

Add code
Apr 01, 2025
Figure 1 for Scaling Language-Free Visual Representation Learning
Figure 2 for Scaling Language-Free Visual Representation Learning
Figure 3 for Scaling Language-Free Visual Representation Learning
Figure 4 for Scaling Language-Free Visual Representation Learning
Viaarxiv icon

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Add code
Dec 18, 2024
Figure 1 for MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Figure 2 for MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Figure 3 for MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Figure 4 for MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Viaarxiv icon

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

Add code
Oct 04, 2024
Figure 1 for VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning
Figure 2 for VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning
Figure 3 for VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning
Figure 4 for VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning
Viaarxiv icon

Efficient Tool Use with Chain-of-Abstraction Reasoning

Add code
Jan 30, 2024
Figure 1 for Efficient Tool Use with Chain-of-Abstraction Reasoning
Figure 2 for Efficient Tool Use with Chain-of-Abstraction Reasoning
Figure 3 for Efficient Tool Use with Chain-of-Abstraction Reasoning
Figure 4 for Efficient Tool Use with Chain-of-Abstraction Reasoning
Viaarxiv icon